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ABSTRACT 

Record linkage has been extensively used in various data mining 
applications involving sharing data. While the amount of avail- 
able data is growing, the concern of disclosing sensitive informa- 
tion poses the problem of utility vs privacy. In this paper, we study 
the problem of private record linkage via secure data transforma- 
tions. In contrast to the existing techniques in this area, we pro- 
pose a novel approach that provides strong privacy guarantees un- 
der the formal framework of differential privacy. We develop an 
embedding strategy based on frequent variable length grams mined 
in a private way from the original data. We also introduce per- 
sonalized threshold for matching individual records in the embed- 
ded space which achieves better linkage accuracy than the existing 
global threshold approach. Compared with the state-of-the-art se- 
cure matching schema [23], our approach provides formal, prov- 
able privacy guarantees and achieves better scalability while pro- 
viding comparable utility. 

1. INTRODUCTION 

Record linkage 1 28 , 10] plays a central role in many data integra- 
tion and data mining tasks that involve data from multiple sources. 
It is the process of identifying records that refer to the same real 
world entity across different sources. It is extensively used in many 
applications, for example, in linking medical data of the same pa- 
tient across different hospitals in the country or in collecting the 
credit history of users from several sources. However, many of 
these data may contain sensitive personal information that could 
disclose individual privacy. For this reason, the problem of privacy 
preserving record linkage has drawn considerable attention over re- 
cent years. The objective is to allow two parties to identify records 
that are close to each other according to some distance function, 
such that no additional information about the data records other 
than the result is disclosed to any party. 

In the existing literature, several privacy models and techniques 
have been proposed, and they can be mainly categorized into a few 
categories: secure transformation [1 7 23 24], Secure Multiparty 
Computation (SMC) (19] [3TJ and hybrid methods (TT] [14] [BJ. 
While SMC techniques guarantee the privacy and the security by 
using cryptographic algorithms, it is computationally prohibitive 
in practice. On the other hand, secure transformation techniques 
achieve the privacy and security using data transformations which 
lead to faster algorithms. However, they have some limitations in 
the privacy model used, for example fc-anonymization [25], where 
high levels of protection typically implies a great loss of accuracy 
in the final results. Finally, hybrid techniques combine the previous 
two strategies leading to a good trade-off between privacy and util- 
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ity. As a drawback, many of these approaches involve heuristics to 
reduce the computational cost and they suffer on the privacy model 
as the secure transformation techniques. 

In this paper we present a new secure data transformation method 
based on an embedding technique. In addition, we provide formal 
guarantees for the individual user privacy by developing a secure 
mechanism under the framework of differential privacy [8]. Our 
approach presents interesting new features such as the use of differ- 
ential privacy techniques in constructing a base for embedding the 
original data that gives a better representation of the data with re- 
spect to a random base. As a proof-of-concept, we focus on string 
records in this paper, and perform approximate matching of the 
records based on a similarity criterion. 



Our contributions: 

• We propose a novel embedding strategy based on frequent vari- 
able length grams to map string records into vectors in the real 
space. We show that the use of frequent variable length grams 
substantially increases the utility of the results with respect to 
random strings. 

• The private record linkage protocol proposed satisfies the differ- 
ential privacy framework which provides formal guarantees of 
individual privacy. In addition, this allows our strategy to have 
a trade-off between privacy level and utility that the user can 
choose by changing the private parameter e. 

• Our strategy allows approximate matching, and in this frame- 
work we introduced the concept of personalized threshold for 
matching the strings embedded in the new space. Contrary to 
matching approaches that use a global threshold, our approach 
better captures the characteristics of each string using a person- 
alized threshold for each record. Indeed, a global threshold does 
not take advantage of the similarity of particular records. In ad- 
dition, a global threshold mechanism generally requires a prior 
knowledge to determine the optimal threshold. On the other 
hand, our strategy automatically computes a personalized thresh- 
old for each string by exploiting the data characteristics. 

• Finally, we present a set of empirical experiments using real 
world datasets showing the benefit of our approach. 

The rest of the paper is organized as follows. Section|2]illustrates 
the current state of the art for the privacy preserving record linkage 
problem. In Section [3] we introduce some basic definitions and 
the privacy model adopted in our solution. Section [4] provides an 
overview of our approach, while in Section [5] and Section [6] we 
describe the major components in our schema. The experiment 
results are reported in Section]?] Finally, we conclude the paper in 
Section[8] 



2. RELATED WORKS 

In this section, we present an overview of the techniques on pri- 
vacy preserving record linkage. We carefully illustrate the secure 
mapping mechanism proposed by Scannepieco et al. [23 1 that rep- 
resents the closest technique to our approach. 

Private Record Linkage Techniques 

There is a broad variety of strategies proposed by the scientific 
community to tackle the private record linkage problem, which dif- 
fer in privacy notions, protocol models, and the type of objects re- 
quired to be matched. In the following, we distinguish these tech- 
niques in three major categories: secure transformations, Secure 
Multiparty Computation, and hybrid methods. 

Secure transformation techniques aim to perform the linkage 
of the records after some transformations have been applied to the 
original data. The typical scenario involves three parties, where 
two parties have the data, and using secure transformation tech- 
niques, they sent the data to a third party whose task is to perform 
the matching. In this framework, two major strategies have been 
proposed: hashing and embedding. Approaches based on hash- 
ing functions try to match strings by hashing the original data and 
computing similarity measures after the hash functions are applied. 
Although, these techniques are quite popular, they do not provide 
a formal bound on the distance in the hashed space. In addition, 
they are subject to dictionary attack, and if the hash function is dis- 
closed the entire security and privacy are compromised. Examples 
of hashing techniques are Bloom filter |24|. g-grams hashing |7 |, 
and TFIDF Hashing (T). 

A more recent example of the embedding strategy for record 
linkage is the approach proposed by Scannepieco et. al. |23| that 
uses SparseMap [ 13 1 to embed strings into a vector space and per- 
form matching in this new space. The core of this approach relies 
on the Lipschitz embedding |4] |16| which consists in projecting 
each original data point s into a new space using a base B of sub- 
sets Ai, B = {Ai,A2, . . . , Ak}, so that each point s is mapped 
via the embedding function p into a vector s £ R fc , where each 
coordinate Si = miring Ai{dEdu(x, s)}, for i — 1,2, ... ,k. The 
distance between vectors in the embedded space is measured by us- 
ing the Euclidean distance d! , while the original space the distance 
metric is the Edit distance dEdu- The general approach guarantees 
the privacy by showing that no information about the datasets are 
disclosed, since only a base of random string is shared between the 
parties. However, to improve the performance of this approach the 
authors in (23) proposed several heuristics that aim to better select 
the random strings in the base. Although these techniques improve 
the performance, they may disclose sensitive information. Indeed, 
as the protocol is designed, the shared base is optimized accord- 
ing the data of one party. Then, if the other party is malicious, it 
may potentially break the privacy by inferring the original from the 
structure of the base. 

A key feature of embedding approaches is that the distance in the 
original space can be put in relationship with the distance in the new 
space depending on the distortion induced by the embedding map. 
Unfortunately, general embedding function are computational ex- 
pensive to apply and some heuristic are needed. Moreover, bound- 
ing the distortion induced by some embedding may be technically 
challenging. For this reason, in this paper we present a embedding 
technique based on the formal model of differential privacy. 

Secure Multiparty Computation (SMC) techniques cast the 
record linkage problem into a secure communication framework. 
In this scenario, several parties are involved in the protocol where 
the communication is done using cryptography. The key idea is that 
the computation itself should reveal no more than whatever may be 



revealed by examining the input and output of each party. An im- 
portant theoretical result in the cryptographic area 1 30 1 shows that 
any computational functions can be computed in this setting. Mo- 
tivate by this, several works have been proposed in the literature. 
For example, when the exact match is considered the record link- 
age problem can be interpreted as a set intersection problem |3 1 1 . 
To mention, the work in 1 19 1 investigates the SMC approach in pri- 
vacy preserving data mining. While in principle the private record 
linkage problem can be solved using SMC and cryptography, the 
computational and communication cost of these methods turn out 
too great in real application. 

Hybrid methods combine anonymization or secure transforma- 
tion techniques with SMC techniques with the aim of reducing 
SMC cost. Inan et al. ] 14 1 proposed a composed strategy based on 
SMC and sanitization to achieve a trade off between privacy and 
utility. This work has been further extended in 1 15 J by differentially 
private blocking followed by SMC techniques for matching record 
pairs in matched blocks instead of matching all record pairs. On 
the framework of blocking record linkage, the work in [ 1 1 1 com- 
bines machine learning techniques in defining the blocking func- 
tions, showing interesting results in the utility. Hybrid techniques 
provide a good trade-off between privacy and accuracy, the SMC 
step still involves high computational cost and the impact of the 
blocking on the linkage accuracy is not clearly understood. 

3. PRELIMINARIES 

In this section, we introduce some notations and definitions re- 
lated to our approach. First, we briefly present notions concern- 
ing strings records and the concept of embedding. Second, we re- 
view the model of differential privacy which is used as our privacy 
model. An overview of the frequent symbols used in the paper is 
summarized in Table Q] 

3.1 Basic Definitions 

Let E be a finite alphabet, we denote by x — xqXx ■ ■ ■ £ n _i a 
string of length n where each symbol Xi is defined in E. Moreover, 
we denote by \x\ the length of the string x. Given a string x of 
length n, we represent a substring from position i to j in x with 
x[i, j] — XiXi+i ■ ■ ■ Xj, where 0<i<j<n — 1. A common 
similarity measure between strings is the Edit distance, known as 
Levenshtein distance, which measures the number of edit opera- 
tions needed to transform a string into the other one. 

Definition 1 (Edit Distance (17J). The Edit distance be- 
tween two strings x and y is defined as the minimum number of 
character edit operations necessary to transform x into y. A single 
character edit operation can either replace, delete or insert a char- 
acter in x or in y. We denote the Edit distance between x and y by 
dEdit(x,y). 

Using this definition of the similarity metric, we can denote by 
(E*; dEdit) the metric space formed by the pair: space of all pos- 
sible strings, and Edit distance. 

In this scenario, we informally introduce the notion of embed- 
ding used in the paper as the map p : E* — >• R fc , where k is the 
dimensionality of the embedded space. Where the distance in the 
new space is the Euclidean distance. 

3.2 Differential Privacy 

Differential privacy (HJ is a recent notion of privacy that aims 
to protect the disclosure of information when statistical data are 
released. The differential privacy mechanism guarantees that the 
computation returned by a randomized algorithm is insensitive to 
change in any particular individual record of the data in input. 



Table 1: Table of frequent symbols 



Symbol 


Dgsc ription 


n , d „ 


Databases for party A and party B 


£ 


Private parameter 


k 


Size of the base 


N 


Size of the databases 


B 


Shared base for the embedding 


x, y, z 


Strings 


x, y, z 


Vectors 


a, /3. (jj 


Substrings 


H 


Length for strings, substring, grams 


dEdu(-) 


Edit distance 


<f(.) 


Distance in the embedded space 


ed 


Threshold in terms of Edit operations 


th 


Threshold in terms of d' 



Definition 2 (Differential Privacy [9]). A non inter- 
active privacy mechanism M has e-differential privacy if for any 
two input sets (databases) Da and Db with symmetric difference 
one (neighbor databases), and for any set of outcomes S C Range(M), 



Pr[M (D a) eS]< exp(e) x Pr[M(D B ) £ S]. 



(1) 



where e is the privacy parameter (also referred to as privacy bud- 
get). Intuitively, lower value of e implies stronger privacy guaran- 
tees, and vice versa. 

This mechanism has two important properties that are exten- 
sively used when differential privacy computations are combined. 
These two properties are known as sequential and parallel compo- 
sitions [21]. The former states that any sequence of computations 
that each provides differential privacy in isolation also provides dif- 
ferential privacy in sequence. 



Theorem 1 (Sequential Composition (2TJ). LetMtbe 
a non-interactive privacy mechanism which provides Ci-differential 
privacy. Then, a sequence of 'Mi(D) over the database D provides 
(y"L £») -differential privacy. 

The latter instead holds when the computations involved are dis- 
joint. In this case, the privacy cost does not accumulate but depends 
only on the worst guarantee. 

Theorem 2 (Parallel Composition (21)). Let Mi be a 
non-interactive privacy mechanism which provides ei-differential 
privacy. Then, a sequence of disjoint Mi(D) over the database D 
provides (max, ef) -differential privacy. 

In literature, there are two major techniques to achieve differ- 
ential privacy: Laplace noise (9) and Exponential Mechanism |20|. 
Both of these strategies are base on the concept of global sensitivity 
1 9 1 for the function to compute. 

Definition 3 (Global Sensitivity |9|). For any two neigh- 
bor databases Da and Db, the global sensitivity for any function 
f : D — > R n is defined as: 



GS(f) := max 



\f(D A )-f(D B )\\ 1 . 



(2) 



In the rest of the paper we restrict our attention on counting 
queries, which can be proved to have global sensitivity GS (count) — 
1. The Laplace mechanism is used in the following sections of 
the paper to construct differentially private algorithms, so for this 
we briefly discuss this mechanism. Let / be a function, and e be 
the privacy parameter, then by adding noise to the result f(D) 
we obtain a differential privacy mechanism. The noise is gener- 
ated from a Laplace distribution with probability density function 
pdf(x\X) = ^c~^ x ^ x , where the parameter A is determinate by e 
and GS(f). 
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Receives V B from B 

(5.2) Uses Th to match the vectors in 
V A with those in V B 

(5.3) Produces the set of neighbour 
vectors N 



party C 



Figure 1: Overview of the Secure Protocol. 

Theorem 3 (Laplace Mechanism (9)). For any function 
/:£)—> W n , the mechanism M (D) that returns: 

M(D) = f(D) + Lap(GS(f)/e), (3) 

guarantees e-differential privacy. 

4. OVERVIEW OF OUR SOLUTION 

In this section we introduce our matching mechanism that relies 
on the idea of matching the string records in a metric space ob- 
tained by embedding the original datasets. We propose an embed- 
ding technique based on grams, and a matching protocol to match 
close records (i.e. within a fixed number of edit operations). The 
base is produced by mining the original data by using differential 
privacy mining algorithms. 

Our first contribution is an embedding strategy which maps the 
original data space into a vector space by projecting each string in 
the databases on a base formed by a set of frequent grams. A gram 
of length q is a substring xoxi ■ ■ ■ x q -\ with q symbols. Each party 
starts to build a base for the embedding by mining grams from the 
strings in its own database, this phase is denoted as mining phase. 
This process is performed with a guarantee of differential privacy, 
so that the parties involved in the protocol can share their bases and 
determinate a common base for the embedding without disclosing 
any sensitive information of individual records. When a final base 
is determined, each party embeds its data using the common base, 
and the matching is performed in the embedded space. We denote 
this step as embedding phase. Our overall protocol is illustrated in 
Figure[T] Our strategy requires the presence of a third trusted party 
denoted by C, whose task consists in matching the records in the 
embedded space. A summary of the steps is reported as follows. 

1. Mining Phase: Parties A and B apply a differentially private 
algorithm to mine their respective databases Da and D B , and 
compute private bases B A and Bb- 

2. Base Generation: One of the two parties is in charge of merg- 
ing the two bases and producing a shared base B of frequent 
variable length grams. 

3. Embedding Phase: Each party A and B, by using the shared 
base, embeds its own data and generates a set of vectors V A and 
Vb respectively, representing the strings in the original datasets. 
These sets are sent to the third party C. 

4. Threshold Generation: Given a maximum value of edit dis- 
tance ed in the original space, the party A generates a set of 
threshold values Th — {th\,th2, . . . , Hin} for the distance in 
the embedding space for each of its string Si, i = 1,2, ... ,N 



in its own dataset Da to use in the matching phase. Each thi 
is a personalized threshold for the string Si and it represents the 
maximum distance between s and z , such that the string z is 
close to Si. This set of thresholds is sent to C. 
5. Matching Phase: The third party C, for each vector s~i G Va 
returns a set of neighbor vectors Af(si) computed as follows: 

JV(si) := {y G V B s. t. d'{si,y) < thi}. (4) 

Figure [2(a)] illustrated the mining and base generation phase. The 
party A and B mine their respective datasets and produce a share 
private base formed by the following grams {A, M, MA, E , o}. This 
shared base is used by each party to embed their own data and pro- 
duce the set of vectors as in Figure [2(b)] In the following sections, 
we describe each phase of our approach in details. 

5. MINING PHASE 

Contrary to the SparseMap approach, we construct a base mined 
from the original data for mainly two reasons. First, we take advan- 
tage of the fact that usually in record linkage scenarios the strings 
that are being matched have similar properties (e.g. same alphabet, 
similar length, etc. ), so by constructing a base from the original 
dataset we can capture this information. Second, a base randomly 
generated cannot well represent every datasets since it is defined in 
a generic way and not data dependent. 

Our idea consists in forming a base to embed the strings in the 
original space by mining the frequent grams in the database. For- 
mally, given a positive integer fc, a minimum length q m i„ and a 
maximum length q max , we are interested in mining the top-fc fre- 
quents g-grams where q G [q m in, qmax], and use this set to con- 
struct the base for the embedding. Intuitively, in this way we can 
obtain a base set that is a good representative for all the strings in 
the databases. This idea has been successfully applied in the ap- 
proximate string matching framework |29| |18| . The problem of 
mining sequential patterns poses several challenges, and the litera- 
ture is rich of efficient solutions to address this problem |12||22| . 
However, a direct use of the mined base from a party in the proto- 
col may disclose information about the strings in its database. For 
example, Atzori et al. (2| investigates the problem of disclosing 
sensitive information in frequent itemsets mining using blocking 
anonymity. In our approach, we consider two approaches to guar- 
antee that the base produced from the mining phase satisfies the 
differential privacy notion. In the following, we present two tech- 
niques to mine the frequent grams. The first approach is an adap- 
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(a) Mining and Base Generation 



Embedded data V A 


Embedded data V B 
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[0,0,0,0,11 
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[0,U,0,0,1J 
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[1,1,0.5,0,61 
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[1,1,0.fa,U,UJ 
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[6,0,0,2,01 
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[0,0,0,2,01 
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[1,1,0.5,0,01 
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[1,1,0.b,0,0J 
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[0,0,0,1,11 
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[0,U,0,0,1J 
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[1,U,U,1,U] 
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h,u,u,i,u] 


7 


[1,0,0,1,01 
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[1,U,U,1,UJ 
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[1,1.0.5.0.01 




8 


[1,1,0,0,11 



(b) Embedded Data 

Figure 2: Example of Mining, Base generation and Embedding of the 
data. 



tation of the score perturbation frequent pattern mining technique 
proposed by R. Bhaskar et al. B], while the second one is adapted 
from the prefix tree mining algorithm introduced in (5). 

The motivation to propose two mining algorithms is based on 
the fact that the algorithm in |3| requires prior knowledge of q m in, 
Qmax and fc to guarantee the differential privacy of the grams. Clearly, 
if these values change or are unspecified, the algorithm has to use 
additional privacy budget to adjust the released data. On the other 
hand, the prefix tree miner approach proposed in this section does 
not have this limitation. Indeed, building the prefix tree can be done 
only one time and does not depend on these parameters (beside for 
the hybrid strategy). Once the prefix tree is produced, it can be 
mined many times for frequent grams of different lengths and k. 
Therefore, this approach is preferable when some of these param- 
eters is not specified, and an exploration of the parameter values 
is required. In this section we prove that our mining phase pro- 
duces a base of grams that guarantees e-differential privacy. In the 
case that the two databases may be correlated as they may contain 
records belonging to the same owner, the total privacy parameter is 
split among the two parties holding the dataset, so that the overall 
privacy level is e. 

5.1 FPM Miner 

The score perturbation frequent pattern mining (FPM) proposed 
in 13) requires in input the value k and the length of the patterns I 
to mine. As a output, it produces a list of the top-fc most frequent 
patterns of length I in the dataset and this algorithm guarantees dif- 
ferential privacy. The original approach runs a non-private min- 
ing algorithm to retrieve the list of frequent itemsets up to a given 
length I first, and then given k it computes the frequency of the fc- 
th most frequent item fk ■ Using this information, the frequencies 
of the items are truncated, and later perturbed with Laplace noise. 
After a sampling process the list of top-fc items are reported. 

In order to mine only patterns instead of items, we run the FPM 
by mining the frequent patterns with the non private algorithm in 
the first step. Then the perturbation and the truncation of the fre- 
quencies are considered only on the universe of patterns. Since, the 
FPM mines fixed length patterns, given the overall privacy param- 
eter e, for our mining phase we can adapt this approach by running 
Aq = q m ax — qmin + 1 runs of the FPM algorithm, where in each 
run the length of the pattern is q for q = q min , q min + 1, . . . , q max , 
the size of the set is fc and with privacy parameter e/A 9 . In this 
way, we mine the top-fc most frequent patterns for each length, and 
finally we select the top-fc most frequent out of them. 

5.1.1 Privacy Analysis 

The privacy analysis follows directly from the application of the 
sequential composition lemma and from the result in (3). 

Theorem 4 (FPM Miner differential privacy). The FPM 
Miner guarantees (.-differential privacy. 

PROOF. The FPM Miner runs A q calls of FPM algorithm which 
has been proven to be e/ A q -differential private in Therefore, 
by the sequential composition lemma for differential privacy the 
overall strategy guarantees e-differential privacy. D 

5.1.2 Complexity Analysis 

The running time for the FPM-miner follows directly from the 
time complexity for FPM algorithm. The authors in [3] shown that 
the complexity for their approach is related to the non-private algo- 
rithm used to initially mine the items and compute their truncated 
frequencies. By running a non private miner that computes the oc- 
currences for all the grams of length q, this phase results into a 



0(k q N) factor, where k is the number of g-grams which frequency 
is grater than ft — 7, and 7 is the utility parameter. From [3], each 
run of FPM requires 0(k q + k log k q + k q N) time, hence the over- 
all complexity of the FPM-miner is 0(A q (k' + k log k' + k'N)), 
where k' = max 9e [, iiiijii , mBl ] k q . 

5.2 Prefix-tree Miner 

In addition to the previous strategy, we adapt the mining algo- 
rithm developed to mine sequential data in (5). For our mining 
task, we partition the space of all the possible grams using a top- 
down approach, where each partition is identified by a node in a 
prefix tree T. Each node has the following information: a prefix 
w, an accumulated privacy budget and the subset of all the strings 
in the original database having uj as a prefix, called the partition 
represented by the node. Then the prefix tree is traversed and all 
the frequent variable length grams are reported. 

The construction of the prefix tree can be summarized as follows. 
Starting from the root of the tree, labeled with the empty string and 
representing the entire space, the database is partitioned by extend- 
ing the prefix of the current node using the strategy in Algorithm 
[Tj For every symbol a in the alphabet E a new node is added to 
the tree if the string uia is a frequent prefix in the dataset, where oj 
is the string formed by the concatenation of the symbols following 
the path from the root to the parent of the current node. To de- 
terminate if a prefix is frequent, in lines 5 to 8 in Algorithm [T] a 
counting query is issued on the partition of dataset represented by 
the current node and the real count is perturbed by Laplace noise 
to guarantee the differential privacy. This noisy count is compared 
against a threshold 9 representing the noisy count for empty par- 
titions. In this process, only partitions with frequent prefixes are 
further refined, and when their frequencies are not large enough 
or a specified depth of the tree is reached the partitioning strat- 
egy stops. The allocation of the budget at each level in the tree is 
performed at line 6 in Algorithm [T| In our approach, we propose 
several strategies to allocate the private budget in the tree: linear 
allocation, exponential allocation, adaptive, and hybrid. Details 
about these strategies are reported at the end of this section. 

An example of the prefix tree obtained from the dataset Da in 
Figure |2(a")| is reported in Figure|3] Each node represents a partition 
on the string in Da, with associated the noisy count of the prefix 
obtained by following the path from the root to the node. For ex- 
ample, the prefix AL induces a partition formed by the strings with 



Algorithm 1 Private Prefix-Tree Partitioner 

1 : procedure PPT PART(raode, T) 

Input: node node; private prefix-tree 7~ 
Output: 7~ private Prefix-Tree 

while {{node. h < Hm ax) (node. budget < e)) do 
for (every symbol a in the alphabet S) do 
u) (path from r to node) + a 
P <— {x £ node. set s. t. uj is a prefix of x} 
e <- ALLOCATE BUDGET (node) 
count <— |P| + Lap(l/e) 



2: 
3: 
4: 
5: 
6: 
7: 
8: 
9: 
10: 
11: 
12: 
13: 
14: 
15: 
16: 
17: 
18: 
19: 



if (count > 9) then 



> Non empty node 



Add a new node cur as child of n ode, such that: 
cur. set <— P, cur.epsilon <— e 
cur. label 4— u, 

cur. budget cur.budget + cur.epsilon 
cur.h 4— node.h + 1 



PPT PART(cwr,T) 
end if 
end for 
end while 
return T 
end procedure 



> Recursive call 



IDs 6 and 7 in Da, and the noisy count for this prefix is 5. 

After we partition the data space using the prefix-tree, we tra- 
verse the prefix tree and for each root-to-leaf path we apply the 
following consistency constraints as in (5): 

Definition 4 (Consistency Constraints (5)). For any 
node v in the prefix tree, the following holds: 

1. count(v) > count(u), where u is a child of v. 

2. count(v) > count(u). 

u is a child of v 

Once the consistency constraints are enforced, we report a list of 
grams by traversing the tree. First of all we can notice that the fre- 
quency count for any gram a; is given by the sum of the frequencies 
of the prefixes that have a; as a suffix. Therefore, for any frequent 
prefix u in the tree we increase the frequency of the substring x of 
length q £ [q m in, qmax] that is a suffix of to — ax by the count of 
uj. To speedup this step, we can adopt a more efficient construction 
of the prefix tree that uses prefix links that can be used to efficiently 
traverse the tree and extract the grams (6) . As a final result, from 
this list we return the top-fc grams sorted by their noisy frequencies. 

5. 2. 1 Budget Allocation Strategies 

In this section, we describe the strategies used in our prefix tree 
partitioning algorithm to allocate the budget in issuing the counting 
query for each node. In the following we denote by e the level of 
privacy that we aim to guarantee with our tree and by Umax the 
maximum depth of the prefix tree. 

• Linear: At each level in the tree we allocate the same amount of 
budget en = e/fiMAX for each node. 

• Exponential: At the level i in the tree, a node can use an amount 
of budget that is double of the amount spent by its parent, e, = 
2ei-i, i — 1, . . . , Iimax — 1. Since on the overall path root- 
leaf we need to guarantee the e-privacy, we start with en = 



2 h MAX+ 1 — 1 

Adaptive: This strategy is an adaptation of the previous expo- 
nential allocation schema, where the entire remaining budget on 
the path is spent on the next counting query if the current node 
represents a non frequent prefix. 

Hybrid: This strategy as the previous one, uses a threshold to 
decide if we use all the remaining budget on the next counting 
query, and in addition the total budget is distributed in the tree 
according to the maximum length q m ax ■ In particular, we reserve 
half of the total budget to the nodes on the first q m ax levels of the 
tree, where for each node the budget is allocated proportionally 
with its level in the tree. For the nodes with level greater than 
q m ax, the remaining part is allocated in an exponential fashion. 
For a node at level i we use the budget t% defined as follows: 



2(2 n MAX ~ 



if < i < q m ax 
otherwise 



(5) 



The intuition behind this strategy consists in avoiding premature 
pruning in the tree. Indeed, with this approach we reserve half 
of the total budget on the first q ma x levels of the tree, so that 
frequent prefixes are likely extended at least to q max symbols. 
This allows us to capture more occurrences of the grams present 
in the frequent prefixes. 

5.2.2 Privacy Analysis 

What is left to show is that the set of grams report satisfies the 
differential privacy. We can notice that, all the partitions produced 
by Algorithm [T] on the same level of the prefix tree are disjoint 
since they correspond to strings with different prefixes. Hence by 



Figure 3: Prefix Tree Example from Da in Figure [2(a)] Each nodes has a noisy count, and the set of identifiers for the strings in the partition. Each 
branch of the tree is labeled with the symbol used to extend the prefix. 



the parallel composition lemma for differential privacy the overall 
privacy level is related to the maximum value of the private budget 
spend in over all the paths from the root to the leaves of the tree. 
Given a sequence of nodes of the tree, the overall privacy level is 
given by sequential composition lemma and it can be computed 
by considering the sum of the privacy parameters used for each 
counting query in each node. 

Theorem 5 (PrefixTree differential privacy). The Pre- 
fix tree Miner guarantees e-differential privacy. 

PROOF. The proof follows directly from the differential privacy 
result proposed in [5] since it is easy that all the allocation strate- 
gies uses at most e budget on the path root-leaf in the tree. □ 

5.2.3 Complexity Analysis 

Algorithm [T] has running time proportional to the number of 
nodes in the prefix tree T. First, we can notice that on the level 
2, we have at most all the possible prefixes of length i defined 
on the alphabet E. Hence, the total number of nodes at level i 
is bounded by 0(|E| ! ). Its real value can be smaller due to the fact 
that some non frequent prefixes are not further extended. More- 
over, each node performs a counting query on the partition of its 
parent that in the worst case requires a linear scan. Therefore each 
level i requires 0(N\'E\ t ) operations, where TV is the size of the 
database in input. Since, in the tree there are at most Km ax level, 
the overall running time for Algorithm [l] is 0(/Vj£j' lM - 4x+1 ). In 
our implementation to speed up the space partitioning, we process 
the children of an internal node in lexicographic order with respect 
to their label. So that, each counting query for a child is issued on 
a subset of positions that have not occurrences of its previous sib- 
lings. This can be performed efficiently by storing for each node 
a sorted list of the positions where the prefix represented by the 
node occurs. Moreover, the list is maintained sorted at each partu- 
rition step without any additional computation cost. After the tree 
is constructed, the consistency constraints requires 0{h\ IAX N) 
operations as shown in [5]. Finally, the running time for traversing 
the prefix tree is linear with the number of internal nodes in the tree 
that is 0(7V|E|' lMAX+1 ). Therefore, the overall prefix tree mining 
algorithm runs in 0(N\Y,\ hMAX+1 ). 

6. EMBEDDING PHASE 

In this section we describe the embedding procedure to map 
strings into vectors. Let B = {31,32, ■ ■ ■ ,9k} be a base of k 
grams, each string s in the database is mapped into a vector s in K fc , 
where each component s; represents the number of occurrences of 
the gram gi in s, normalized by the length of gi. Let Occ s (g) de- 
note the set of positions in s where the gram g occurs, 

Occ s (g) := {i £ [0, s| - \g\] : s[i, 1 + \g\ - 1] = g} 

Each coordinate of the vector representing s is defined as Si — 
\Occ s (gi)\/\gi\, for i — 1, . . . , k. In this new space the distance 



between vectors is computed using the Euclidean distance as fol- 
lows: 

2 V /2 

d'{x,y) = ||x - y\\ 2 = I ^(xi - Vif 1 (6) 

This new distance measure can be interpreted as a weighted ver- 
sion of the g-gram distance proposed by Ukkonen |26|, where the 
contribution of each gram is reduced by a factor equal to the length 
of the gram, and where the grams considered are a subset of the 
shared grams between the strings in the original space. Intuitively, 
with this mapping procedure, strings that are close in the original 
space remain close also in the vector space. 

6.1 Threshold Computation 

After all the records are embedded into the new space the ap- 
proximate matching between the records is performed. In particu- 
lar, we are interested in matching strings that in the original space 
are within ed edit operations. In the new space, this task is casted 
into the problem of finding all the vectors whose Euclidean distance 
is within a threshold value th. This threshold value plays a central 
role on the overall performance of the matching schema, since it 
decides how many vectors become candidate records in represent- 
ing close strings. Therefore, it is crucial to compute a threshold 
value as tight as possible to the real value. This problem is gen- 
erally very hard, since proving formal guarantees requires analysis 
on the distance distortion and properties of the embedding strategy 
used. For example, for the Lipschitz embedding it can be shown 
that the distortion for this mapping is 0(log n) where n is the size 
of the original data space |13|[4). 

Global threshold: This strategy aims to define a global thresh- 
old value that can be used in matching all the records in the new 
space. This can be done by estimating how the original distance is 
distorted after the embedding map is applied. In this direction, we 
proposed the following upper bound for our embedding. 

Proposition 1 (Upper bound). Given x and y, two strings 
in the original space with Edit distance dEdit (x,y) < ed, an upper 
bound on d' (x,y) is given by A q -ed, where A q = q m ax — qmin + 1 

PROOF. For a fixed length q of the grams, and for any ed value 
of edit distance between x and y, the number of grams of length 
q that can contribute is at most q ■ ed (since on a position i at 
most q grams of length q are overlapping). Since, the base B 
used in the embedding is a subset of all possible q-grams, with 
q £ [qmin, q ma x] therefore it follows that: 

d'(x,y) = \\x-y\\2< A q -ed (7) 

□ 

This upper bound is not very tight, and does not take into account 
the fact that each string shares different number of grams from the 



Algorithm 2 Dynamic Algorithm for computing th 

1: procedure Personalized Th(s, D[-], P[-], ed) 
Input: String s; D[-], P[-], and edit distance ed. 
Output: Threshold value th 

2: for(i = 0, 1, . . . , erf) do 

3: JbrO' = 0, 1, \s\ -l)do 

4: T[i][j] = max{T[i]b - l],T[i - l][P[j]] + D[j}} 

5 : end for 

6: end for 

7: return th = s/T[ed\\j] 
8 : end procedure 



base used for the embedding. For example, by referring to the sce- 
nario in Figure[2] if ed — 1 the upper bound for a global threshold 
in matching the embedded data is (2 — 1 + 1) • 1 = 2, 
Personalized threshold: In contrast, in our approach we define a 
personalized threshold for each string that has to be matched. This 
choice is motivated by the following reasons. First, each string 
shares a different number of grams with the base, and for those 
strings that have a large number of shared grams a personalized 
threshold can better represent the original distance. Second, by 
restricting the attention in computing a threshold for each string 
we overcome the problem of estimating a threshold suitable for 
all the strings required to be matched. To compute this personal- 
ized threshold, we use a dynamic programming algorithm which 
for each string computes the maximum distance in the embedded 
space where a vector representing a string within ed edit operations 
can be founded. In particular, we adopt the algorithm proposed in 
|29| , where in our case we keep track of the impact of each gram 
on the final distance rather than counting only their occurrences. 

Given a string s, we start to initialize two data structures as fol- 
lows: 



Table 2: Experimental Datasets Statistics 



$s(i,g) 



, t = 0,l,...,|a|-l 



1 if g overlaps s at position i 
otherwise 

P[i] = maxjj < i — \g\ — 1 where j £ Occ s (g)} 



(8) 
(9) 



For each position i in s, the entry D[i] represents the total contribu- 
tion the grams overlapping at position i may have in the coordinate 
of s if an edit operation occurs at position i. On the other hand, 
P[i] denotes the nearest position j < i where a gram of the base 
occurs in s. Using these structures and given a maximum value 
of edit operations allowed ed, we are interested in computing for 
each string s the maximum allowable distance th in the embedding 
space. Given i edit operation, the dynamic programming strategy 
considers two cases. First, an edit operation does not occur at po- 
sition j, so the contribution remain the same as in the previous 
position — 1]. Second, a edit operation occurred in j then 

we need to add the contribution from the grams in the current posi- 
tion D [i] to the contribution of having i — 1 edit operations in the 
positions before j. The overall strategy is reported in Algorithm[2] 
and the running time for this dynamic programming algorithm is 
0(ed-\s\). 

6.2 Protocol Complexity Analysis 

In this section we analyzed the complexity of our protocol in 
terms of time and communication complexity. 

The mining phase in our strategy leads to O(k'N) for the FPM 
miner and (N\'E\ hMAX+1 ) for the Prefix tree based approach. Af- 
ter the base B is computed the databases are mapped into the vec- 



Dataset 


N 




Imin 




NAMES 


150000 


15 


4 


7 


CITIES 


5000 


23 


3 


8 



Table 3: Table of experiment and algorithm parameters 



Symbol 


Description 


Default values 


e 


Private parameter 


0.1 


k 


Size of the base 


75 


qmin 


Min length of the grams 


1 




Max length of the grams 


3 


ed 


Edit operations 


{0,1,2} 


h-MAX 


Max depth of prefix tree 




e 


Threshold for noisy count 


as in 5 



tor space in the embedding phase. Let I be the maximum length 
of the strings in the databases, then this step requires 0(lkN) op- 
erations, since each gram can have at most I occurrences in each 
string. When the strings are transformed into vectors, each pair of 
vectors can be matched using the Euclidean distance in O(k) time. 

In addition to the complexity analysis reported above, we also 
investigated the communication complexity required by our proto- 
col. We are interested in providing an upper bound on the overall 
amount of bits transmitted between the party. Without lost of gen- 
erality, we can assume that every string in the database requires 
0(ln I) bits to be represented. Therefore, the shared base B formed 
by the mined grams is transmitted by using 0(k\nq ma x) bits. On 
the other hand, the embedded records are vectors in R fc , and there- 
fore they require O(kN) bits to be represented. The overall com- 
munication complexity for this protocol is 0(k(N + In q ma x))- 

7. EXPERIMENTAL RESULTS 

In this section, we present experimental results obtained with our 
private record linkage protocol. Our goal is to understand the im- 
pact of the mining and embedding phases on the overall utility of 
our protocol as well as the dependency from the private parameter 
e and the dimensionality k. We compare our approach with Scan- 
napieco et. al's |23| matching schema to show the improvement 
in scalability, the benefit of a stronger privacy model and the use 
of personalized threshold in the matching schema while achieving 
comparable data utility. 

The protocols were implemented in Java, and the simulations 
were conducted on an Intel Core i5 2.5Ghz PC with 4GB of RAM. 
In the experiments, we use two real datasets NAME S Q and CITIES. 
The first set contains a list of the most frequent surnames from the 
Census 2000 in United States. The second dataset is a list of the top 
5000 most populated cities in United States in 2008. Some of the 
statistics of the datasets are summarized in Table [2] where l ma x, 
Imin and lavg are the maximal, minimal and average lengths of the 
strings, respectively. 

Before presenting the results, we describe the simulation sce- 
nario. First of all, the datasets do not contain duplicated records, 
and the simulation proceeds as follows. The party A holds a dataset 
Da while the party B has a perturbed copy Db of Da, where each 
string is corrupted up to ed edit operations. So in this way we can 
run the experiments for different values of edit operations. Below, 
we first study the utility of the mined base, and then we examine the 
utility of the overall linkage results. We use the Fi score as utility 
metric [27J, which combines the precision and recall in matching 
the strings between the datasets. 

The experiment and algorithm parameters, if not specified in the 



NAME S is publicly av ailable at United States Census 
(http://www.census.gov/genealogy/www/data/2000surnames/) 
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Figure 4: Utility for mining algorithms 

description of the simulations, assume the default value as reported 
in Tabled 

7.1 Mining Utility 

In our first set of experiments, we examine the utility of the 
mined frequent grams with respect to the privacy parameter e and 
the value k. For this set of experiments due to space limitations, we 
report only the results from the NAME S dataset. Similar results can 
be observed on the CITIES dataset. 

Figure[4]illustrates the utility if the results produced by the FPM 
and prefix tree miner approaches with respect to different values 
of privacy level and number of grams mined. From Figure [4] we 
can see that increasing the privacy parameter e (i.e. decreasing the 
privacy level) leads to better performances. Among the allocation 
strategies proposed in Section |5.2.1| the hybrid strategy provides 
significantly better utility for small e values, but slightly worse util- 
ity when e > 0.2. This confirms the fact that this strategy has 
been designed to reduce the early pruning in the tree due to small 
e values. For the rest of the simulations on the linkage, we will 
run the prefix tree miner by using the linear, and hybrid allocation 
strategies, since they provide the best utility for low and high lever 
of privacy respectively. On the right part of Figure [4] we illustrate 
the dependency between the utility and the number of grams mined 
(fc). As we can see, all the allocation strategies have the same be- 
havior, where increasing the value of k decreases the utility of the 
mined base. This result is related to the fact that the prefix tree is 
based on growing frequent prefixes. So when k is large some of the 
occurrence of the grams may not appear in the tree due to the fact 
that prefixes are not extended enough to capture their occurrences. 
We can also observe that, the hybrid strategy performs consistently 
better than to the other strategies. 

The mining performance provided by the FPM-miner, in our 
simulations, outperforms that of the prefix tree miner in both the 
scenarios described above. Although the FPM-miner provides util- 
ity results close to the optimal value, it suffers of the fact that the 
values for q m in, qmax and k must be specified a priori. Indeed, 
if the optimal value for these parameters is unknown, or one of 
these parameters changes the FPM-miner must run again and it in- 
creases the total privacy budget spent. On the other hand, the prefix 
tree miner with linear, exponential and adaptive budget allocation 
strategies does not have this limitation, since the construction of 
prefix tree does not depend on these parameters. Only traversing 
the tree requires knowledge of q m in, qmax and k, and it is done on 
data that are already differentially private. 

7.2 Linkage Utility 

In this part of the simulations, we illustrate the advantages of 
having a personalized vs a global threshold, the dependency of the 
utility results for different parameters, as well as the overall perfor- 
mance of our mechanism. 
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Figure 6: Impact of the base size, ed = 1 

7.2.1 Global Vs Personalized Threshold 

First, we study the importance of using frequent grams over ran- 
dom grams in the base. Figure [5] reports the utility in terms of Fi 
score for different values of global threshold (Th) in the embed- 
ded space, with a direct comparison between random and frequent 
grams. We tested the embedding strategy on the NAMES dataset, by 
allowing an approximate matching with the number of edit opera- 
tions up to 2, with k — 100. For our scenario, two edit operations 
represent a high level of perturbation of the original dataset, since 
they correspond to change roughly up to 35% of the symbols (in 
average) on the strings. From the graph, it is evident that frequent 
grams lead to a considerable improvement in the utility with respect 
to random grams (an improvement from 20% to 60%). This result 
is justified by the fact that a base of frequent grams is more likely 
to share a higher number of grams with the strings. Therefore, the 
frequent base can better preserve the distance and better represent 
the original data with respect to a random base. In addition, we can 
also observe that when approximate matching is allowed the utility 
decreases as the edit operations increase. Indeed, the matching be- 
comes harder to perform since more grams are affected by the edit 
operations. 

Second, form Figure[5] we can observe that using a global thresh- 
old in the matching we have an optimal point that maximizes the 
utility. However, it is not easy to compute it a priori. For exam- 
ple, by using the upper bound obtained in Section [6] we have that 
Th < q m ax ■ ed — 6, and in this case is too large to provide good 
utility. For this reason, the rest of the results are obtained by using 
the personalized threshold approach. 

7.2.2 Impact of the base size 

In Figure [6] we examine the impact of the size of the shared 
base k on the utility of the linkage results. In these simulations, 
we consider an approximate matching scenario with ed = 1. The 
experiments show that the utility increases by increasing the di- 
mensionality till a maximum value is reached. We can observe that 
when a base has size in between 50 and 75 grams, our approach 
provides better performance for both datasets considered. After 
that, a further increase of the dimensionality degenerates the utility 
of the results. This phenomena is very peculiar, since usually the 
dimensionality in the embedded space helps with the utility. How- 



ever, it can be explained with the fact that a large base can better 
preserve the distance in the space, but it also increases the num- 
ber of grams that potentially we can lose in case of edit operations. 
Therefore, the threshold value increases and the performance de- 
creases as shown in Figure[6] 

7.2.3 Impact of the privacy parameter 

The relationship between the privacy parameter e and the utility 
of our protocol is reported in Figure[7]for the datasets considered. 
In these graphs, we also compare the results provided by private 
miners with non private miner. As we can see, the utility of the 
protocols that use private miners approach the results obtained from 
the non private algorithm as e increases. Moreover in Figure|7]we 
can see that by just allowing one edit operation the utilities of the 
matching results for all the approaches shifts down by at most 18% 
for the datasets considered. This result points out the hardness of 
solving record linkage when approximate matching is allowed. In 
the case of exact match the utility provided by the private mecha- 
nisms are with the 0.4% of the non private one for every e value 
tested. While one edit is allowed, this gap slightly increases. This 
behavior is more evident for the mining approach based on the pre- 
fix tree. From this set of simulations, we can infer that the embed- 
ding phase plays a crucial step in the overall performance of our 
algorithm. Since, as shown in Figure [7] the final utility does not 
have strong dependency on the utility form the mining phase. 

7.2.4 Impact of edit operations 

In Figure[8] we present the impact of edit operations on the utility 
of the protocol. These pictures show that as the number of edit op- 
erations increases the overall utility decreases for all the strategies 
tested. The edit operations considered in these simulations vary 
from (exact match) to 2 edits (35% of the avg length of the string 
in the dataset). We can notice that the F\ score decreases from 0.99 
in the case of exact matching to 0.36 when 2 edits are allowed for 
the NAMES dataset, while for the CITIES dataset stops at 0.60. 
This phenomena is due to two important aspects. First, as we men- 
tioned earlier, when the edits increase, more grams could be lost 
between strings and the base formed by the frequent grams mined 
in the first phase. Second, given a string s the size of the set of pos- 
sible matching strings within ed edit operations N(s, ed) increases 
exponentially with ed, indeed \M{s,ed)\ < |s| ed |E| eti . Concern- 
ing the performances, we can notice that the results reported by 
FPM-miner and prefix tree hybrid approach are close to the utility 
of the results provided by the non private miner. 

7.2.5 Protocol Performances 

To examine the scalability of our strategy, we performed several 
experiments with different dataset sizes, and the results are reported 
in Figure [9] The running time is measured in milliseconds [ms], 
and it consists in the time needed to mine the base for each party, to 
combine the bases to form the shared base, to embed the data, and 
finally to compute the personalized threshold. In addition, we con- 
sider only exact matching, since the Lipschitz mechanism in |23| 
requires priory knowledge of the threshold value in the embedding 
space to allow approximate matching. As we can see from Figure 
[9] the running time for our protocol is linear with the size of the 
dataset considered, and this allows good scalability. These results 
empathizes the fact that on real data the performance of the algo- 
rithm are more realistic than the results in the complexity analysis 
pointed out in Section |6~2] Figure[5]shows also the running time for 
the Lipschitz approach with heuristic proposed by Scannapieco et 
al. in (23) . For this schema, we measure the time required to gen- 
erate the base using the heuristic and produce the embedding map. 
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(b) Utility results with CITIES dataset 



Figure 7: Impact of the privacy parameter 

Utility Vs Edit Operations (NAMES) Utility Vs Edit Operations (CITIES) 





■■■non-private miner 




PT linear 




• PT hybrid 




■©■FPM-miner 





2 O.f 



u. 0.( 



0.5 1 1.5 
Edit Operation (ed) 



0.4 







■■■non-private miner 

♦ PT linear 

♦ PT hybrid 

♦ FPM-miner 





0.5 1 1.5 
Edit Operation (ed) 



Figure 8: Impact of edit: (Left) names, (Right) cities. 

As we can see, also this approach scales linear with the size of the 
dataset, but our technique requires almost half of the running time 
to achieve similar utility results. Indeed, for Lipschitz we choose 
a size of the base equals to 12 that as shown in Figure [10] maxi- 
mizes the utility and reduces the running time with respect to the 
dimensionality. 

In particular, for the Lipschitz approach we generate a candidate 
set of random strings of length 10 , and in our simulation we use 
a candidate set of 1000 random strings (0(ln N)) as suggested in 
|23| . In Figure [10] we test the approaches with different base sizes 
on the CITIES dataset. As we can see from the curves, Lipschitz 
provides a better utility for smaller base, and our strategies achieve 
similar results for k > 20. Although the secure protocol in |23| 
provides high level of accuracy, it requires priori knowledge of the 
threshold value, and this is not always possible. In addition, the 
dependency of the running time with respect to the dimensionality 
is exponential, while with our approach is only linear. 

8. CONCLUSIONS 

In this paper, as a global achievement we presented a novel strat- 
egy to perform privacy preserving record linkage for string records. 
First, we tested and adapted the differential privacy mining tech- 
niques presented in 15) and (5) to mine frequent variable length 
grams required in our embedding strategy. Second, we introduced 
an embedding technique to perform a secure transformation that 
use a private base extracted from the original data. As the final con- 
tribution, we used the concept of personalized threshold in match- 
ing records which allows us to compute a threshold value for each 
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Figure 9: Running Time Vs N: (Left) frequent variable length grams 
embedding, (Right) Lipschitz embedding. 
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Figure 10: Comparison between Lipschitz and frequent grams em- 
bedding 

record to match and does not require any priori knowledge. Our 
overall strategy presents comparable performance with the tech- 
nique proposed by Scannapieco et al. in |23| , and well fits in the 
secure data transformation framework by satisfying the strong pri- 
vacy model of differential privacy, and presenting good scalability. 

Future works will address more complex allocation strategies for 
the prefix tree miner. For instance, we will consider how to take 
advantage of the statistical properties for the English words to issue 
counting queries. Moreover, we will investigate the possibility of 
proving formal bound on the utility of the results by assuming some 
statistical distributions for the datasets. 

9. ACKNOWLEDGMENT 

This material is based upon work supported by the National Sci- 
ence Foundation under Grant No. 1 1 17763. 

10. REFERENCES 

[1] A. Al-Lawati, D. Lee, and P. McDaniel. Blocking-aware private record linkage. 
In Proceedings of the 2nd international workshop on Information quality in 
information systems, IQIS '05, pages 59-68, New York, NY, USA, 2005. ACM. 

[2] M. Atzori, F. Bonchi, F. Giannotti, and D. Pedreschi. Blocking anonymity 
threats raised by frequent itemset mining. In ICDM, pages 561-564, 2005. 

[3] R. Bhaskar, S. Laxman, A. Smith, and A. Thakurta. Discovering frequent 
patterns in sensitive data. In Proceedings of the 16th ACM SIGKDD 
international conference on Knowledge discovery and data mining, KDD '10, 
pages 503-512, New York, NY, USA, 2010. ACM. 

[4] J. Bourgain. On lipschitz embedding of finite metric spaces in hilbert space. 
Israel Journal of Mathematics, 52:46-52, 1985. 10. 1007/BF02776078. 

[5] R. Chen, B. C. M. Fung, B. C. Desai, and N. M. Sossou. Differentially private 
transit data publication: A case study on the montreal transportation system. In 
Proc. of the 18th ACM SIGKDD International Conference on Knowledge 
Discovery and Data Mining (SIGKDD), Beijing, China, August 2012. ACM 
Press. 

[6] Y. Cheng and T. Zhang. Design of fast multiple string searching based on 

improved prefix tree. In Knowledge Discovery and Data Mining, 2010. WKDD 
10. Third International Conference on, pages 111 -1 14, jan. 2010. 

[7] T. Churches and P. Christen. Some methods for blindfolded record linkage. 
BMC Medical Informatics and Decision Making, 4(1):9, 2004. 

[8] C. Dwork. Differential privacy. In in ICALP, pages 1-12. Springer, 2006. 

[9] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to 

sensitivity in private data analysis. In Theory of Cryptography, Third Theory of 
Cryptography Conference, TCC 2006, pages 265-284, 2006. 
[10] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record 
detection: A survey. IEEE Trans. Knowl. Data Eng., 19(1): 1-16, 2007. 



1151 

|16| 

[17] 
[18] 

[19] 
|2()| 

|21| 

[22] 

[23] 

|24| 

[25] 

[26] 

[27] 
[28] 

[29] 

[30] 

1311 



L. O. Evangelista, E. Cortez, A. S. da Silva, and W. M. Jr. Adaptive and flexible 
blocking for record linkage tasks. JIDM, 1(2): 167-182, 2010. 
J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate 
generation. SIGMOD Rec, 29(2):1-12, May 2000. 

G. Hjaltason and H. Samet. Properties of embedding methods for similarity 
searching in metric spaces. Pattern Analysis and Machine Intelligence, IEEE 
Transactions on, 25(5):530 - 549, may 2003. 

A. Inan, M. Kantarcioglu, E. Bertino, and M. Scannapieco. A hybrid approach 
to private record linkage. In Proceedings of the 2008 IEEE 24th International 
Conference on Data Engineering, ICDE '08, pages 496-505, Washington, DC, 
USA, 2008. IEEE Computer Society. 

A. Inan, M. Kantarcioglu, G. Ghinita, and E. Bertino. Private record matching 
using differential privacy. In Proceedings of the 13th International Conference 
on Extending Database Technology, EDBT 1 10, pages 123-134, New York, NY, 
USA, 2010. ACM. 

W. B. Johnson and J. Lindenstrauss. Extensions of lipschitz mapping into 
hilbert space. In Conf. in modern analysis and probability, volume 26 of 
Contemporary Mathematics, pages 189-206. American Mathematical Society, 
1984. 

V. Levenshtein. Binary Codes Capable of Collecting Deletions, Insertions and 
Reversals. Soviet Physics Doklady, 10:707, 1966. 

C. Li, B. Wang, and X. Yang. Vgram: improving performance of approximate 
queries on string collections using variable-length grams. In Proceedings of the 
33rd international conference on Very large data bases, VLDB '07, pages 
303-314. VLDB Endowment, 2007. 

Y. Lindell and B. Pinkas. Secure multiparty computation for privacy-preserving 
data mining. Cryptology ePrint Archive. Report 2008/197, 2008. 
F. McSherry and K. Talwar. Mechanism design via differential privacy. In 
Foundations of Computer Science, 2007. EOCS '07. 48th Annual IEEE 
Symposium on, pages 94 -103, oct. 2007. 

F. D. McSherry. Privacy integrated queries: an extensible platform for 
privacy-preserving data analysis. In Proceedings of the 35th SIGMOD 
international conference on Management of data, SIGMOD '09, pages 19-30, 
New York, NY, USA, 2009. ACM. 

J. Pei, J. Han, B. Mortazavi-Asl, J. Wang, H. Pinto, Q. Chen, U. Dayal, and 
M.-C. Hsu. Mining sequential patterns by pattern-growth: the prefixspan 
approach. Knowledge and Data Engineering, IEEE Transactions on, 
16(11):1424- 1440, nov. 2004. 

M. Scannapieco, I. Figotin, E. Bertino, and A. K. Elmagarmid. Privacy 
preserving schema and data matching. In Proceedings of the 2007 ACM 
SIGMOD international conference on Management of data, SIGMOD '07, 
pages 653-664, New York, NY, USA, 2007. ACM. 

R. Schnell, T. Bachteler, and J. Reiher. Privacy-preserving record linkage using 

bloom filters. BMC Medical Informatics and Decision Making, 9(1):41, 2009. 

L. Sweeney, k-anonymity: a model for protecting privacy. Int. J. Uncertain. 

Fuzziness Knowl. -Based Syst, 10(5):557-570, Oct. 2002. 

E. Ukkonen. Approximate string-matching with q-grams and maximal matches. 

Theor. Comput. ScL, 92(1): 191-21 1, Jan. 1992. 

C. Van Rijsbergen. Information retrieval. Butterworths, 1979. 

W. Winkler. Overview of record linkage and current research directions. 

Technical Report Statistics #2006-2, Statistical Research Division, U.S. Bureau 

of the Census, 2006. 

X. Yang, B. Wang, and C. Li. Cost-based variable-length-gram selection for 
string collections to support approximate queries efficiently. In In SIGMOD 

Conference, pages 353-364, 2008. 

A. C. Yao. Protocols for secure computations. In Proceedings of the 23rd 

Annual Symposium on Foundations of Computer Science, SFCS '82, pages 

160-164, Washington, DC, USA, 1982. IEEE Computer Society. 

A. C.-C. Yao. How to generate and exchange secrets. In Foundations of 

Computer Science, 1986., 27th Annual Symposium on. pages 162 -167, oct. 

1986. 



